DAssd-Net: A Lightweight Steel Surface Defect Detection Model Based on Multi-Branch Dilated Convolution Aggregation and Multi-Domain Perception Detection Head

During steel production, various defects often appear on the surface of the steel, such as cracks, pores, scars, and inclusions. These defects may seriously decrease steel quality or performance, so how to timely and accurately detect defects has great technical significance. This paper proposes a lightweight model based on multi-branch dilated convolution aggregation and multi-domain perception detection head, DAssd-Net, for steel surface defect detection. First, a multi-branch Dilated Convolution Aggregation Module (DCAM) is proposed as a feature learning structure for the feature augmentation networks. Second, to better capture spatial (location) information and to suppress channel redundancy, we propose a Dilated Convolution and Channel Attention Fusion Module (DCM) and Dilated Convolution and Spatial Attention Fusion Module (DSM) as feature enhancement modules for the regression and classification tasks in the detection head. Third, through experiments and heat map visualization analysis, we have used DAssd-Net to improve the receptive field of the model while paying attention to the target spatial location and redundant channel feature suppression. DAssd-Net is shown to achieve 81.97% mAP accuracy on the NEU-DET dataset, while the model size is only 18.7 MB. Compared with the latest YOLOv8 model, the mAP increased by 4.69%, and the model size was reduced by 23.9 MB, which has the advantage of being lightweight.


Introduction
Steel accounts for more than 90% of all metals used in industrial production, because it is a material with high strength and ductility, and excellent manufacturability, at the lowest costs [1], ideal for machines, civil structures, transportation equipment, and endless list of tools [2]. As a metallic material [3], steel is widely used in manufacturing processes, such as brazing [4][5][6], laser welding [7,8], and additive manufacturing [9]. Surface defects in steel are often related to the microstructure changes during steel fabrication [10], and the interaction between alloying elements and microstructure can affect the formation of surface defects, thereby affecting the organization and mechanical properties of steels [11,12]. Steel surface defects may include cracks, bubbles, inclusions, scars, scratches. These defects will have a negative impact on the quality and performance of steel. These defects will reduce the strength, toughness, and ductility of the material, thereby affecting the service life and safety performance of the steel structures. Steel surface defects can also affect the appearance quality, product dimensions, and unstable performance [2]. By detecting and analyzing defects on the steel surface, potential safety hazards can be discovered in time, providing a basis for quality control and improvement in the production process. In conducive to the network's understanding of semantic information and can improve the position accuracy of the final detection result.
Based on the above review, this paper addresses the technology gap by proposing an innovative model, DAssd-Net, for steel surface defect detection. The model is shown to fully consider the size of the receptive field, as well as the spatial and channel information of the feature map. Compared with other mainstream target detection models, the DAssd-Net can achieve 81.97% mAP accuracy on the NEU-DET dataset, while the model size is only 18.7 MB. The mAP index is 4.69% higher than the latest YOLOv8 model, while the model size is 23.9 MB smaller, and has more accuracy.

Image Processing-Based Detection Method
Using traditional image processing methods to deal with steel surface defects usually requires the manual selection of parameters (such as threshold) and algorithms [51], and it is difficult to automatically adapt to the characteristics and needs of different images. According to the characteristics of steel surface defects, some scholars [52,53] designed or improved classic operators for detection accuracy. However, these methods cannot deal with the noise and distortions existing in the images, resulting in the degradation of the image quality after processing. To extract the features of steel surface defects more accurately, some studies [54,55] have designed more complex feature-extractors by combining multiple methods. These feature-extractors can improve the extraction of defect features and provide useful assistance for subsequent detection. However, these methods usually require many calculations and operations, the processing speed is slow, and it is difficult to detect in real-time.

Deep Learning-Based Detection Method
Traditional image processing methods require the manual design of feature-extractors, while deep learning methods can automatically learn features from data, thereby avoiding manually designing the feature-extractors [56,57]. More deep learning-based defect detection methods have been applied to steel materials. In actual steel surface defect detection, a diversity of defect sizes and shapes, object backgrounds, and complex lighting environments are encountered.
For small-sized defects, the surface texture and color changes are also relatively small, so it is difficult to extract distinguishing features. Studies [58,59] have improved the representation of the model for small defects by designing feature-enhancement modules and making use of multi-scale features. It has been shown to effectively resolve the abundant texture variations and small-sized defects on the target surface. Other studies [60,61] have improved the accuracy of defect detection by fusing feature maps of different levels and different sizes. The robustness of using multi-scale features is shown to adapt to the diversity of defect shapes. In further studies [62][63][64][65], researchers have obtained the full-scale features to detect defects of multiple scales using a multi-scale feature fusion network.
In defect detection, it is necessary to pay attention to the key areas of random defects, and it is also necessary to reduce background interference and illumination changes on defect detection. Researchers [66] have proposed an Adaptive Graph Channel Attention (AGCA) module to improve the feature representation ability. Study [67] uses Channel Attention Module (CAM) and Bidirectional Feature Fusion (BFFN) module to fully fuse features. It has shown that a combination of the two models can reduce the impact of complex environments. Study [68] uses the coordination attention (CA) module, which improves the ability of the network to locate defects.
From the above survey, it can be found that there is a gap in detecting the scale and location distributions of steel surface defects; defects with extreme scale distributions (such as tiny and giant defects) cannot be detected. In addition, devices with limited computing resources often have small storage capacity and computing power, and the design of some models often ignores the impact of model size and inference speed, making it difficult for models to run on hardware with limited computing power. Therefore, to fully improve the lightweight and the detection ability of the model for defects of different scales, this study proposes a model DAssd-Net for steel surface defect detection.

Steel Surface Defect Detection Network DAssd-Net
We will introduce the overall framework of the proposed steel surface defect detection model based on dilated convolution and attention fusion modules. We will then introduce the model substructures, including the Dilated Convolution Aggregation Module (DCAM), Dilated Convolution and Channel Attention Fusion Module (DCM), and Dilated Convolution and Spatial Attention Fusion Module (DSM).

Overall Network Architecture
Designing a specific object detection model usually requires analyzing the distribution of sample boxes in the dataset. The center point coordinates of the sample boxes will be used to indicate the relative position of the defect examples in the image, while the width and height of the sample boxes will be used to indicate the size distribution of the defect examples. By analyzing the distribution of the center point and size of the detection boxes, a detection model tailored to the dataset is designed.
The overall structure of the proposed steel surface defect detection network DAssd-Net is shown in Figure 1. We use the lightweight model MobileNetv2 as the backbone network for feature-extraction from steel surface defect images. To reduce the redundancy and increase the size of the receptive field of the network, the FPN structure composed of the DCAM module is proposed for feature fusion. To improve the network's function for the important areas and channels of the image or feature map, we integrate the DSM and DCM modules in the detection head network. Such integration can identify the defect category and locations.

Dilated Convolution Aggregation Module (DCAM)
We count all the normalized label boxes x, y, width, and height information in the data set for statistics, and calculate the center point distribution and size distribution of all label boxes. In Figure 2, most of the center points of the label boxes are in the center of the y direction (y = 0.5), and the scale (i.e., height and width) distribution is mostly concentrated on small sizes, but there are also many large-scale label boxes. Due to the limited receptive field of ordinary convolution, it is difficult to detect large objects in the image. Dilated convolution is used to increase the receptive field without increasing the number of parameters. The expansion of the receptive field helps to detect large targets. When different dilation rates are selected, receptive fields of different sizes are obtained, along with multi-scale information.
The size of the two-layer convolution kernel is 3 × 3, and the size of the receptive field after stacking the ordinary convolution with a step size of one is 5 × 5, as shown in Figure 3a. When the expansion rate is (3,5), the size of the receptive field is 17 × 17. From Figure 3b, it can be found that there is a lack of correlation between the convolution results of this layer, resulting in the loss of local information and the gridding effect. To solve the problem, the Hybrid Dilated Convolution (HDC) [69] criterion is used to design the size of the expansion rate. We designed the expansion rate to be (1,3,5) and the perception size to be 19 × 19, as shown in Figure 3c, to avoid the gridding effect. In following equations, we assume that the convolution kernel size of the dilation convolution is k and the stride is one, then the size of the receptive field of the expansion convolution of the i + first layer is r f i+1 : where k is the size of the convolution kernel, k is the size of the equivalent convolution kernel, d is the expansion rate, and r f is the size of the receptive field.
stride is one, then the size of the receptive field of the expansion convolution of the i + first layer is where k is the size of the convolution kernel, k is the size of the equivalent convolution kernel, d is the expansion rate, and rf is the size of the receptive field.   We have adopted a structure like the Receptive Field Block (RFB) [70], but due to the small size of the dataset image, we have redesigned a more streamlined Dilated Convolution Aggregation Module (DCAM). Specifically, we designed two dilated convolution branch structures, then concatenated the branch structures, and finally eliminated the gridding effect through a 3 × 3 convolution with an expansion rate of 1. This design has  We have adopted a structure like the Receptive Field Block (RFB) [70], but due to the small size of the dataset image, we have redesigned a more streamlined Dilated Convolution Aggregation Module (DCAM). Specifically, we designed two dilated convolution branch structures, then concatenated the branch structures, and finally eliminated the gridding effect through a 3 × 3 convolution with an expansion rate of 1. This design has We have adopted a structure like the Receptive Field Block (RFB) [70], but due to the small size of the dataset image, we have redesigned a more streamlined Dilated Convolution Aggregation Module (DCAM). Specifically, we designed two dilated convolution branch structures, then concatenated the branch structures, and finally eliminated the gridding effect through a 3 × 3 convolution with an expansion rate of 1. This design has ensured Sensors 2023, 23, 5488 7 of 21 the acquisition of local detailed information. The 1 × 1 convolution in DCAM is mainly to adjust the number of channels.

Dilated Convolution and Channel Attention Fusion Module (DCM)
In the CNN-based target detection model, to provide more information (such as color, texture, and shape) and better representation capabilities, a multi-channel (such as the number of channels C = 256) method is usually used. However, in the actual model, there may be a certain correlation between different channels, and there may be some redundant information. Channel attention is used to select important information and suppress redundant information [47,48]. As shown in Figure 4, we have designed the Dilated Convolution and Channel Attention Fusion Module (DCM), using the method of channel attention and dilated convolution fusion to improve the representation and generalization of the model. ensured the acquisition of local detailed information. The 1 × 1 convolution in DCAM is mainly to adjust the number of channels.

Dilated Convolution and Channel Attention Fusion Module (DCM)
In the CNN-based target detection model, to provide more information (such as color, texture, and shape) and better representation capabilities, a multi-channel (such as the number of channels C = 256) method is usually used. However, in the actual model, there may be a certain correlation between different channels, and there may be some redundant information. Channel attention is used to select important information and suppress redundant information [47,48]. As shown in Figure 4, we have designed the Dilated Convolution and Channel Attention Fusion Module (DCM), using the method of channel attention and dilated convolution fusion to improve the representation and generalization of the model. , will pass through two branches: the channel attention branch and dilation convolution branch. The channel attention branch will adaptively adjust the weight of each channel according to the importance of the channel to better capture the salient features in the input data. Use the channel-by-channel global average pooling operation to compress the channel dimension of c I , and obtain the global information 11C   z between channels, and the c th channel of z as follows: The one-dimensional convolution kernel is used to learn the dependencies between channels and normalize the weights through the sigmoid function. This operation produces the channel feature map ( where 1 Conv d indicates a one-dimensional convolution kernel [71] with kernel size equal to three and padding equal to one.  represents the sigmoid function.  A given feature map, I c ∈ R H×W×C , will pass through two branches: the channel attention branch and dilation convolution branch. The channel attention branch will adaptively adjust the weight of each channel according to the importance of the channel to better capture the salient features in the input data. Use the channel-by-channel global average pooling operation to compress the channel dimension of I c , and obtain the global information z ∈ R 1×1×C between channels, and the c th channel of z as follows: The one-dimensional convolution kernel is used to learn the dependencies between channels and normalize the weights through the sigmoid function. This operation produces the channel feature map s ∈ R 1×1×c : where Conv1d indicates a one-dimensional convolution kernel [71] with kernel size equal to three and padding equal to one. σ represents the sigmoid function. Finally, the feature map I c is multiplied elementwise by the normalized channel weight s to obtain the output channel feature map O c ∈ R H×W×C :

Dilated Convolution and Spatial Attention Fusion Module (DSM)
In steel surface defect detection, different types of defects often show different forms. For example, crazing defects often have a wavy texture shape, and inclusion defects often have irregular oval shapes. In addition to shapes, different types of defects have different sizes. Pitted-surface type defects are often large, while rolled-in scale type defects are often small. To better focus on the area of each different defect's shape and size, as shown in Figure 5, we have proposed the Dilated Convolution and Spatial Attention Fusion Module (DSM). The module can better focus on the local area in the image, thereby improving the perception of local features, while increasing the receptive field.

Dilated Convolution and Spatial Attention Fusion Module (DSM)
In steel surface defect detection, different types of defects often show different forms. For example, crazing defects often have a wavy texture shape, and inclusion defects often have irregular oval shapes. In addition to shapes, different types of defects have different sizes. Pitted-surface type defects are often large, while rolled-in scale type defects are often small. To better focus on the area of each different defect's shape and size, as shown in  , are obtained, respectively: where . After s y is subjected to a 2D convolution operation with a kernel size equal to three, the dimension is reduced to one channel, and then the spatial attention feature  A given feature map I s ∈ R H×W×C will go through the spatial attention branch and the Dilation convolution module branch. First, after a channel-based global average pooling and global max pooling, g s ∈ R H×W×1 and m s ∈ R H×W×1 , are obtained, respectively: where I C (i) ∈ R C , i = 1 · · · H × W, I C (i), i = 1 · · · H × W represents the set of all channels in which each spatial pixel is located. g s is the average value in all channel sets where each spatial pixel is located. m s is the maximum value in all channel sets where each spatial pixel is located. Then, g s and m s are spliced according to the channel direction to obtain y s ∈ R H×W×2 . After y s is subjected to a 2D convolution operation with a kernel size equal to three, the dimension is reduced to one channel, and then the spatial attention feature map c ∈ R H×W×1 is generated through the sigmoid function: where Concat represents the splicing operation by channel, Conv2d represents the 2D convolution operation, and σ represents the sigmoid function. The feature map and the normalized spatial weight are multiplied elementwise to obtain the output channel feature map O s ∈ R H×W×C : where I s is obtained by dilation convolution with different expansion rates (1,3,5)  This paper uses the steel surface defect detection public dataset NEU-DET [72], which includes a total of 1800 images of six types of defects, 300 images of each defect type, and the size of the image is 200 × 200 pixels. We have renamed the six defect types for convenience: C (crazing), RS (rolled-in scales), I (inclusion), P (patches), PS (pitted surface), and S (scratches).
Through formula (13), the pixel value format (x min , y min , x max , y max ) of the upper left corner and lower right corner of the data set label are converted into the center point and width and height format (x c , y c , w, h).
where width is the width of the image, and height is the height of the image. The pairwise relationship between the four attributes of x, y, width, and height is described. The diagonal line represents the histogram (distribution map) of each attribute, and the off-diagonal line represents the correlation between two different attributes. From Figure 6, it can be found that the center points of the target frame are mostly concentrated in the central area of the image. However, the width and height attributes are unevenly distributed, and there is a distribution at the maximum value, suggesting that there is a large-sized target box.

Experimental Parameter Settings
The experiment is carried out on a PC with a 12th Gen Intel ® Core™ i5-12400F processor, NVIDIA GeForce RTX 3060 Ti GPU, CUDA 11.3, cuDNN 8.2.1, and Windows 10 operating system. The experimental code is written and debugged on Python Integrated Development Environment Pycharm, and the deep learning framework used is Pytorch. The number of training data sets is 1440, the number of verification data sets is 180, and the number of test data sets is 180. For the simulations, the initial learning rate is set to 0.01, the Adam optimizer is used, the momentum is set to 0.937, the cosine function learning rate decay method is used, and the total epoch is 300 rounds. The IOU loss uses the GIOU loss function [73], and the confidence loss and classification loss use the binary cross-entropy loss function. learning rate decay method is used, and the total epoch is 300 rounds. The IOU loss uses the GIOU loss function [73], and the confidence loss and classification loss use the binary cross-entropy loss function.

Evaluation Criteria
The experiment mainly evaluates the model for accuracy and lightweight. AP (Average Precision) and mAP (mean Average Precision) are indicators used to evaluate the detection performance of a single category and the performance of the entire object detection system, respectively. Additionally, mAP is used as the final evaluation metric of model performance. We have analyzed the AP of each defect, mAP, model parameter size, theoretical amount of floating-point arithmetic (FLOPs), theoretical amount of multiply adds (MAdd), memory usage, and model storage size. For each category, all test images are sorted by confidence, and then each detection box is regarded as a positive example according to the confidence level from high to low, and it is matched with the ground truth box according to the IOU value. In the experiment, we set the IOU value to 0.5. If the IOU value is greater than a certain threshold, the detection box is regarded as a true case (TP), otherwise, it is regarded as a false positive case (FP). To calculate the accuracy and recall rate, it is necessary to judge whether the detection result is correct according to the IOU value of the detection frame and the real frame. We can calculate the AP value and mAP value of each category using the following formula:

Evaluation Criteria
The experiment mainly evaluates the model for accuracy and lightweight. AP (Average Precision) and mAP (mean Average Precision) are indicators used to evaluate the detection performance of a single category and the performance of the entire object detection system, respectively. Additionally, mAP is used as the final evaluation metric of model performance. We have analyzed the AP of each defect, mAP, model parameter size, theoretical amount of floating-point arithmetic (FLOPs), theoretical amount of multiply adds (MAdd), memory usage, and model storage size. For each category, all test images are sorted by confidence, and then each detection box is regarded as a positive example according to the confidence level from high to low, and it is matched with the ground truth box according to the IOU value. In the experiment, we set the IOU value to 0.5. If the IOU value is greater than a certain threshold, the detection box is regarded as a true case (TP), otherwise, it is regarded as a false positive case (FP). To calculate the accuracy and recall rate, it is necessary to judge whether the detection result is correct according to the IOU value of the detection frame and the real frame. We can calculate the AP value and mAP value of each category using the following formula:

Data Augmentation Strategies
We have used a detection network that does not use the data enhancement strategy as the baseline and considered the impact of the data enhancement strategy Mixup [74] and Mosaic [28] on the accuracy of model detection. In Table 1, we can see that the Mixup strategy does not bring about an improvement in accuracy, and the use of the mosaic strategy will bring about a 3.56% increase in mAP. The Mixup strategy randomly selects two images from each batch and mixes them in a certain ratio to generate a new image. This strategy generates a virtual blend of images, introducing the noise. The Mosaic data enhancement stitches four different images to form a new sample, thereby improving the diversity and representativeness of the data and helping to improve the generalization ability of the model. This paper selects to use the mosaic strategy as the data augmentation strategy for the experiments.

DCM and DSM
We have evaluated the performance of DCM, DSM, and convolution on different detection head tasks. The neck network currently is a DCAM module. The proposed DCM and DSM can fuse force channel and spatial attention, aiming to have different attention regions for different detection tasks. In Table 2, when the classification task branch uses DSM and the regression task branch uses DCM, the highest mAP is achieved on the NEU-DET dataset. The DCM pays more attention to defects in crazing and rolled-in scales categories, while the DSM pays more attention to defects in inclusions and patches categories. This result shows that channel attention can capture features such as texture in the image, while spatial attention can capture features such as target positions and edges in the image. The heat map can visualize the prediction results of the model for each pixel, usually by means of color coding to represent the confidence of different pixels. Pixels with higher confidence are represented by warmer tones, such as red, and pixels with lower confidence are represented by cooler tones, such as blue. In addition, the heat map can also help us analyze the detection results of the model, such as judging which areas are easy to detect and which areas are easy to ignore. Figure 7 shows the heatmaps that visualize the fusion of DCM and DSM modules across different detection head tasks. When DSM is added to the regression head, the model can accurately locate the defect position. When the classification head adds the DCM module, the model can remove the redundant information of concern. In addition, using the DSM module can pay more attention to the spatial position of the target, and the DCM module can better remove redundant information. Through the visual analysis of the heat map of the model, it can be shown that DCM and DSM have the functions of accurately locating defect positions and removing redundant information.
mation. Through the visual analysis of the heat map of the model, it can be shown that DCM and DSM have the functions of accurately locating defect positions and removing redundant information.   We have also performed heat map visualization analysis on images with both scratches and inclusion defects (Figure 8). The yellow ellipse in the image represents the region of interest of the large object heatmap, the black triangle represents the region of interest of the defect feature fuzzy object heatmap, and the red rectangle represents the region of interest of the heatmap containing the category. From the detection results, we can find that when one of the detection heads uses a common convolution structure, the model does not detect tiny defects. When DSM and DCM are used in combination, the model can more accurately identify the defect location and remove the redundant information of the defect. When the classification detection head uses DSM and the regression detection head uses DCM, the model pays more attention to small defects, while the degree of attention to other defects is more accurate and effective. region of interest of the heatmap containing the category. From the detection results, we can find that when one of the detection heads uses a common convolution structure, the model does not detect tiny defects. When DSM and DCM are used in combination, the model can more accurately identify the defect location and remove the redundant information of the defect. When the classification detection head uses DSM and the regression detection head uses DCM, the model pays more attention to small defects, while the degree of attention to other defects is more accurate and effective.

Effect of Different Modules
To evaluate the impact of the specific modules of our proposed DAssd-Net model structure on detection accuracy, we have conducted experiments on different modules separately. In Table 3, when the DCAM module instead of CSPLayer is used in the feature fusion structure, the accuracy is increased by 2.58 mAP. It shows that the DCAM module can effectively increase the receptive field of the model and help the model better capture the context information of the target, thereby improving the detection performance of the model. The integration of DSM and DCM modules in the model detection head can better capture features and improve model accuracy. We present the heatmap visualization results in Figure 9. In terms of model size, the DAssd-Net model structure replaces the ordinary convolution structure with a complex structure and many parameters, but with a significantly reduced model size.
To evaluate the impact of the specific modules of our proposed DAssd-Net model structure on detection accuracy, we have conducted experiments on different modules separately. In Table 3, when the DCAM module instead of CSPLayer is used in the feature fusion structure, the accuracy is increased by 2.58 mAP. It shows that the DCAM module can effectively increase the receptive field of the model and help the model better capture the context information of the target, thereby improving the detection performance of the model. The integration of DSM and DCM modules in the model detection head can better capture features and improve model accuracy. We present the heatmap visualization results in Figure 9. In terms of model size, the DAssd-Net model structure replaces the ordinary convolution structure with a complex structure and many parameters, but with a significantly reduced model size.

Comparison with Other Models
We have compared the proposed DAssd-Net model with other mainstream object detection models. These models include CenterNet [75], YOLOv5 [76], YOLOv5-v6.1 [77], YOLOv7 [29], and YOLOv8. The experiments are conducted using the same equipment and training strategy to compare the performance of steel surface defect detection in terms of accuracy and lightweight on the NEU-DET dataset. The heat map visualization results of different models are shown in Figure 10. Our proposed model can accurately identify the location of the target area with almost no redundant attention information. Other models do not pay enough attention to the center of the object.
YOLOv7 [29], and YOLOv8. The experiments are conducted using the same equipment and training strategy to compare the performance of steel surface defect detection in terms of accuracy and lightweight on the NEU-DET dataset. The heat map visualization results of different models are shown in Figure 10. Our proposed model can accurately identify the location of the target area with almost no redundant attention information. Other models do not pay enough attention to the center of the object.  Table 4 compares the accuracy of AP and mAP of different types of defects among different models. We can see that the DAssd-Net model we proposed can achieve the highest mAP. Except for the rolled-in scales defect category, our proposed models can achieve the highest accuracy. Compared with the newly proposed YOLOv8 model, our model has a 4.69% accuracy improvement.  Table 5 compares the different models for parameter quantity, FLOPs, Madd, Memory usage, and model storage size. It shows that the DAssd-Net model we proposed is superior to other models in terms of parameter quantity and model size and can realize the lightweight of the model. In terms of model complexity, the DAssd-Net model is superior to other models for a lower computational complexity and a higher operating efficiency. The final detection results are shown in Figure 11. We compare our proposed model with current mainstream object detection models. We can find that the two defects of crazing and rolled-in scale are not detected in most target detection models. The YOLOv8 model and YOLOv7 model only detect one target of the crazing defect category, while our proposed model can detect two defect locations. In small target defects, other models have missed detection, such as inclusion and patched defect categories, while our model can detect small target defects. The results show our proposed model can accurately identify the defect locations. Figure 11. Comparison of the detection results between the proposed model and the mainstream target detection model.

Conclusions
A lightweight steel surface defect detection model-DAssd-Net-is proposed. The model uses a multi-branch Dilated Convolution Aggregation Module (DCAM), which can effectively expand the receptive field and enhance contextual information fusion. Through experiments and heat map analysis, it can be found that:

•
In the ablation experiments where DCM and DSM act on different detection head tasks, the model achieves the highest accuracy of 81.97% when DSM is fused with the classification detection head and when DCM is fused with the regression detection head. The heat map shows that the current model pays more attention to the spatial position of the target, is more sensitive to the detection of large targets, and suppresses the generation of redundant channel information.

•
Compared with other mainstream target detection models, the DAssd-Net we proposed can achieve 81.97% mAP accuracy on the NEU-DET dataset, while the model size is only 18.7 MB. It is 4.69% higher than the latest YOLOv8 model mAP index, and the model size is 23.9 MB less, with advantages and lightweight.

Conclusions
A lightweight steel surface defect detection model-DAssd-Net-is proposed. The model uses a multi-branch Dilated Convolution Aggregation Module (DCAM), which can effectively expand the receptive field and enhance contextual information fusion. Through experiments and heat map analysis, it can be found that:

•
In the ablation experiments where DCM and DSM act on different detection head tasks, the model achieves the highest accuracy of 81.97% when DSM is fused with the classification detection head and when DCM is fused with the regression detection head. The heat map shows that the current model pays more attention to the spatial position of the target, is more sensitive to the detection of large targets, and suppresses the generation of redundant channel information. • Compared with other mainstream target detection models, the DAssd-Net we proposed can achieve 81.97% mAP accuracy on the NEU-DET dataset, while the model size is only 18.7 MB. It is 4.69% higher than the latest YOLOv8 model mAP index, and the model size is 23.9 MB less, with advantages and lightweight.

Prospects
• The expansion rate of the dilated convolution determines the size of the receptive field of the convolution kernel, but different expansion rates correspond to different receptive field sizes, and the expansion rate needs to be manually adjusted to obtain the optimal receptive field size. In future work, we will study an expansion rate structure that can be adaptively designed according to the data set to improve the accurate acquisition of the size of the receptive field. • The proposed model has good performance indicators on the PC side, but it needs to consider the model deployment to limited computing resources in the actual production process. In future research, it is necessary to use techniques such as model compression, model quantification, and knowledge distillation to deploy the model to meet the requirements of real-time and reliable steel surface defect detection. Institutional Review Board Statement: Not applicable.